The directory contains four files: two for Fiction and two for Non-Fiction. This readme contains the description of columns from each file along with other useful information on filtering heuristics.

We also make the accompanying code publicly available at https://github.com/sunyam/million-page-hathi. Please use this code as documentation of what we did, and not as a reproducible workflow.



################ Metadata ################

The metadata files — Fiction_metadata.tsv and NONfiction_metadata.tsv — contains the metadata for all fiction and non-fiction volumes from our dataset. The columns include:

- htid: Unique and permanent Hathi Trust item identifier.

- year: Publication date of the volume. Derived from the “rights_date_used” field in Hathifiles.

- title: Title of the work; may include author information. Derived from the “title” field in Hathifiles.

- author: The name of the person, company or meeting that created the work. Derived from the “author” field in Hathifiles.

- page_numbers: List of page numbers from the corresponding volume that satisfied our sampling criteria. 

- page_numbers_str: List of pages’ filenames (strings) as they appear when downloaded in the HTRC Capsule. Same as page_numbers.

See https://www.hathitrust.org/hathifiles_description for a more detailed description.



################ Enriched Features ################

The Enriched-Feature-Set files — Enriched_Feature_Set_Fiction.csv and Enriched_Feature_Set_NONfiction.csv — contains the computed features for all fiction and non-fiction pages from our dataset. The columns include:

- TXT_FILENAME: Unique identifier for each page in our dataset. This field contains information about the volume (htid) as well as the page number. This information is extracted and provided as two separate columns described below: “htid” and “page”. For example, if the TXT_FILENAME is “uc1.b4456456____PAGE____00000608_clean.txt”, the corresponding ‘htid’ and ‘page’ columns would be “uc1.b4456456” and “608” respectively.

- htid: HTID of the volume that this page comes from. Note that this field can be used to index into the metadata files (or Hathifiles) to extract more information about the volume.

- page: Page number.

- Year: Publication date of the volume.

- TotalLines: The number of text lines on that page.

- TotalWords: The number of words (lexemes) on that page. This does not include punctuation.

- TotalTokens: The number of total tokens (including punctuation) on that page.

- AvgSentlen: Average sentence length on that page. It is defined as the ratio of the total number of words to the total number of sentences.

- PctDialog: The amount of dialogue on the page. It is computed as the ratio of the number of words in quotes to the total number of words on the page.

- Tuldava: Tuldava readability score for that page.

VADER_*: The sentiment scores for that page as returned by VADER, a popular lexicon and rule-based sentiment analysis tool. It returns four scores that are included as four separate columns in our dataset:
- VADER_compound: a normalized weighted composite score between -1 (most extreme negative) and +1 (most extreme positive)
- VADER_neg: a ratio for proportions of text that fall in VADER’s negative-lexicon
- VADER_neu: a ratio for proportions of text that fall in VADER’s neutral-lexicon
- VADER_pos: a ratio for proportions of text that fall in VADER’s positive-lexicon
The `pos', `neg', `neu' columns should sum up to 1. See VADER’s documentation for more details: https://github.com/cjhutto/vaderSentiment

- NRC_*: The emotion scores per page computed using the NRC Word-Emotion Association Lexicon. The lexicon is available at https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm. It consists of a list of English words and their associations with ten basic emotions & sentiment (anger, fear, anticipation, trust, surprise, sadness, joy, disgust, positive, negative). We compute the score for each of the ten emotions which are present as separate columns, namely: NRC_anger, NRC_anticipation, NRC_disgust, NRC_fear, NRC_joy, NRC_negative, NRC_positive, NRC_sadness, NRC_surprise, NRC_trust. Note that each of these columns is normalized by the total number of words on the page.
    
- Part-of-speech tags: We process our pages through BookNLP and release the frequency of the part-of-speech tags (Penn Treebank Project) for each page. The frequency is normalized by the total number of words on the page. These include the following columns: #,$,'',",",-LRB-,-RRB-,.,:,CC,CD,DT,EX,FW,IN,JJ,JJR,JJS,LS,MD,NN,NNP,NNPS,NNS,O,PDT,POS,PRP,PRP$,RB,RBR,RBS,RP,SYM,TO,UH,VB,VBD,VBG,VBN,VBP,VBZ,WDT,WP,WP$,WRB,`. See https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html for a complete description of these tags.

- Supersense tags: We include the frequency of the BookNLP supersense tags for each page. These are of the form “noun.*”, “verb.*” and “O” tags. The complete list can be accessed at: https://wordnet.princeton.edu/documentation/lexnames5wn. Note that the frequency is normalized by the total number of words.


These features were computed using BookNLP (Java version): https://github.com/dbamman/book-nlp.
Note that a few thousand pages (~10k) could not be processed by BookNLP and therefore do not have a corresponding row in the Enriched-Feature tables but do exist in the metadata tables.



################ Scripts ################

# To read the metadata file as a Pandas DataFrame (python)
>> meta = pd.read_csv(‘/path/Fiction_metadata.tsv', delimiter='\t')

# To read the enriched-features file as a Pandas DataFrame (python)
>> feats = pd.read_csv(‘/path/Enriched_Feature_Set_Fiction.csv')

## Note that some column names might contain special characters such as `` or $. If desired, you can modify the column-names using:
feats.columns = feats.columns.str.replace('``', 'DoubleQuotes') ## to substitute `` with ‘DoubleQuotes’ in column-names
feats.columns = feats.columns.str.replace('$', 'DollarSign') ## to substitute $ with ‘DollarSign’ in column-names



################ Heuristics ################

For a more conservative filtering of non-prose literary works, we use a heuristic set of title and genre keywords to remove inappropriate volumes from each of our classes prior to classification. For Fiction and Non-Fiction, we discard all works by filtering any volume which has the following word(s) in its title:

- poet* (poetry, poetical, poets etc.)
- poem* (poem, poems etc.)
- drama*
- magazine
- “in [blank] act[s]”
- “in verse”
- “a comedy”
- “a tragedy”
- “report of”
- “trial of”

Additionally for Non-Fiction, we also remove dictionaries using two approaches:
- Discard any volume that has the following words in its title: “register”, ”index”, ”dictionary”, “lexicon”, “directory”.
- We rank all the pages by average sentence length and the number of punctuations. Next, we discard the outliers, that is, any volume that ends up in the top or bottom 250 pages of the two distributions.